Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference

نویسندگان

  • Raymond Kosala
  • Maurice Bruynooghe
  • Jan Van den Bussche
  • Hendrik Blockeel
چکیده

Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on IE from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rewrite-Based Verification of XML Updates

We propose a model for XML update primitives of the W3C XQuery Update Facility as parameterized rewriting rules of the form: ”insert an unranked tree from a regular tree language L as the first child of a node labeled by a”. For these rules, we give type inference algorithms, considering types defined by several classes of unranked tree automata. These type inference algorithms are directly app...

متن کامل

Wrapper Induction: Learning (k,l)-Contextual Tree Languages Directly as Unranked Tree Automata

A (k, l)-contextual tree language can be learned from positive examples only; such languages have been successfully used as wrappers for information extraction from web pages. This paper shows how to represent the wrapper as an unranked tree automaton and how to construct it directly from the examples instead of using the (k, l)-forks of the examples. The former speeds up the extraction, the la...

متن کامل

On Probability Distributions for Trees: Representations, Inference and Learning

We study probability distributions over free algebras of trees. Probability distributions can be seen as particular (formal power) tree series [BR82; EK03], i.e. mappings from trees to a semiring K. A widely studied class of tree series is the class of rational (or recognizable) tree series which can be defined either in an algebraic way or by means of multiplicity tree automata. We argue that ...

متن کامل

Learning (k, l)-Contextual Tree Languages for Information Extraction

Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this s...

متن کامل

Information extraction from structured documents using k-testable tree automaton inference

Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree auto...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003